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We show how to achieve a statistical description of the hierarchical structure of a multivariate 
data set. Specifically we show that the similarity matrix resulting from a hierarchical clustering 
procedure is the correlation matrix of a factor model, the hierarchically nested factor model. In this 
model, factors are mutually independent and hierarchically organized. Finally, we use a bootstrap 
based procedure to reduce the number of factors in the model with the aim of retaining only those 
factors significantly robust with respect to the statistical uncertainty due to the finite length of data 
records. 
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Many complex systems observed in the physical, bi- 
ological and social sciences are organized in a nested 
hierarchical structure, i.e. the elements of the system 
can be partitioned in clusters which in turn can be par- 
titioned in subclusters and so on up to a certain level 
0, 0- Several examples of hierarchically organized phys- 
ical [1,111, biological [IBB] and social 0, @, ES El sys- 
tems have been investigated in the literature. The hierar- 
chical structure of interactions among elements strongly 
affects the dynamics of complex systems. Therefore, a 
quantitative description of hierarchical properties of the 
system is a key step in the modeling of complex sys- 
tems. In this letter, we address the problem of inferring 
a factor model from a multivariate data set. A factor 
model is a mathematical model which attempts to ex- 
plain the correlation between a large set of variables in 
terms of a small number of underlying factors. A ma- 
jor assumption of factor analysis is that it is not pos- 
sible to observe these factors directly; the variables de- 
pend upon the factors but are also subject to random 
errors [13] • We show that the factor model we introduce 
fully describes the hierarchical structure of interactions 
among elements of the complex system. Such a struc- 
ture is elicited by hierarchical clustering of multivariate 
data. The analysis of multivariate data provides cru- 
cial information in the investigation of a wide variety of 
systems. Multivariate analysis methods are designed to 
extract information both on the number of main factors 
characterizing the dynamics of the investigated system 
and on the composition of the groups (clusters) in which 
the system is intrinsically organized. Recently, physi- 
cists started to contribute to the development of new 
multivariate techniques (e.g. [H El Q EE El El El ) • 
Among multivariate techniques, natural candidates for 
detecting the hierarchical structure of a set of data are 
hierarchical clustering methods fl9| . These methods al- 
low to associate a dendrogram with a correlation matrix 
(or more generally with a similarity matrix), i.e. they 
give a schematic description of hierarchies. It is worth 
pointing out that the whole information contained in the 
dendrogram can be stored in a filtered similarity matrix 



C < [n|. The matrix C K has well defined metric prop- 
erties. When the matrix C < of elements pf^ is obtained 
by starting from a correlation matrix, then the matrix of 

distances <i<- = y^2(l — p<) has ultrametric properties 

In this letter, we answer the following scientific ques- 
tion: given a multivariate data set is it possible to con- 
struct a factor model retaining the whole information 
about hierarchies which is detected by a hierarchical clus- 
tering? In the following, we show that it is possible to 
give a description of hierarchies detected by hierarchi- 
cal clustering in terms of a factor model, termed Hier- 
archically Nested Factor Model (HNFM). This model is 
constructed in such a way that its correlation matrix co- 
incides with the similarity matrix filtered by the cho- 
sen hierarchical clustering procedure. Furthermore, for a 
hierarchical clustering performed by estimating a correla- 
tion matrix from an empirical data set which is unavoid- 
ably of finite size, i.e. a set of N elements each character- 
ized by a number T of records, we provide a bootstrap 
based methodology allowing to remove from the model 
those factors which are characterized by a statistical reli- 
ability smaller than a predefined standard threshold, e.g. 
95%. In this letter, we consider time series, however the 
results are general and also valid for any investigation of 
multivariate data. There are many clustering algorithms 
here we use the Average Linkage Cluster Analysis 
(ALCA). However, we wish to point out that our tech- 
nique can be used with most clustering algorithms giving 
a dendrogram [25| . such as, for example, the single link- 
age clustering algorithm. 

Hereafter, we provide a methodology to associate a 
nested factor model with a multivariate data set. The 
association is done by retaining all the information about 
the hierarchies detected by a hierarchical clustering. This 
is achieved by considering a factor model in bijective re- 
lation with a dendrogram (or with the filtered matrix 
C < ), which is the output of a hierarchical clustering. We 
are going to introduce our method by making use of the 
illustrative dendrogram given in Fig. [5] A dendrogram 
is a rooted tree, i.e. a tree in which a special node (the 
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FIG. 1: Illustrative example of a rooted tree associated with 
a system of N = 10 elements (leaves in the tree). The symbols 
{ai, ag} labels the iV — 1 = 9 internal nodes. 



root) is singled out. In our example this node is ct\. In 
the rooted tree, we distinguish between leaves and in- 
ternal nodes. Specifically, vertices of degree 1 are rep- 
resenting leaves (vertices labeled 1,2,. ..,10 in Fig. 
while vertices of degree greater than 1 are representing 
internal nodes (vertices labeled ax, 0:2,..., ag in Fig. [5]). 
We associate a genealogy G(i) (G(ah)) with each leaf i 
(internal node a/ l ).The genealogy is the ordered set of 
internal nodes connecting leaf i (internal node ah) to the 
root a\. For instance, in Fig. [5j the genealogy associated 
with the leaf 3 is G(3) = {017, a%, a\} and the genealogy 
of the internal node 017 is G{ai) = {a7,a2,ai}. Note 
that the internal node a? is included in G(aj). Finally, 
we say that an internal node w is the parent of the node 
v, and we use the notation w = g(v), if w immediately 
precedes v on the path from the root to v. For exam- 
ple it is a2 = g{aj) in Fig. [5j Beside the topological 
structure, dendrograms obtained through standard hi- 
erarchical clustering algorithms applied to a correlation 
matrix have also metric properties. In fact, clustering al- 
gorithms associate a correlation coefficient p ai with each 
internal node ai [l9| . Our internal node labeling implies 
that p ai < Pa i+1 and here we consider p ai > [26|. The 
whole information about the rooted tree is stored in the 
N x N matrix C< of elements pfj — p ak , where ak is 



the first internal node in which leaves i and j are merged 
together [li|. For example, in Fig. [51 it is pf 7 = p ai and 
P57 = Pa&- I n C < there are at most N — 1 distinct coef- 
ficients. Exactly N — 1 distinct coefficients are obtained 
in case of binary rooted trees. Since any rooted tree can 
be obtained from a rooted binary tree by introducing a 
degeneracy of nodes, in the following we consider binary 
rooted trees. 

Here we show that the matrix C< is the correlation 
matrix of a HNFM defined as 



Xi(t) 



(1) 



where i e {1,...,N}, Vi = [1 - Ea h eG W 7*J 1/2 - The 
h th factor f^ ah '{t) and e,(t) are independent identically 
distributed (i.i.d.) random variables with zero mean and 
unit variance. In order to ensure that the correlation 
matrix of the model of Eq. (fTJ) is C < , the 7 parameters 
need to be chosen as 
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y/p ah ~ Pg( ah ) V/i = 2,...,n- 1 (2) 



where, assuming p ai > 0, all the coefficients j ah are non 
negative real numbers. Hereafter we show that the ma- 
trix C< is the correlation matrix of the factor model of 
Eq. (fT]) with coefficients 7's given in Eq. ([2|). Let us con- 
sider a generic pair of elements i and j merging together 
at the node ak corresponding to the correlation level p ak . 
We prove that the cross correlation (xiXj) equals the cor- 
relation pf- — p ak . In fact, the cross correlation (x^Xj) 

depends only on the factors f^ ah ^ which are common to 
Xi and Xj . Since we associate a factor with each internal 
node, we need to identify the internal nodes belonging 
to both the genealogies G(i) and G(j). One can verify 
that G(i) n G(j) = G(ak)- For example, in Fig. [JJ we 
have that G(2) — {ag,a2,ai} and G(3) = {07,02,0^1} 
so that G{2) n G(3) = {a 2 , a{\ = G(a 2 ). By making use 
of Eqs. (HI [2]) the cross correlation between variables Xi 
and Xi is 



(XiXj 



E 

a h eG(a k ) 



I 2 

loch 



(3) 



a h GG{i) 



For example, with reference to Fig. [5j we have (X2X3) = 

7« 2 + 7ai = P<*2 - Pat + p ai = Pa 2 - Thus tne matrix 
C < is the correlation matrix associated with the factor 
model of Eq. ([T|). It is worth noting that the matrix C< 
is positive definite, because, as we have shown, C K is the 
correlation matrix of a factor model. In conclusion, the 
HNFM is a factor model taking into account the hier- 
archical properties of the investigated system which are 
elicited from data by hierarchical clustering. 

It is worth pointing out that the simple investigation 
of the eigenvectors of the correlation matrix is not al- 
ways suitable to detect the hierarchical structure and 
the group composition of the system. When the corre- 
lation matrix is block diagonal, the eigenvalue spectrum 
has a number of large eigenvalues equal to the number 
of groups. Moreover, each corresponding eigenvector has 
non vanishing components only for the elements of a spe- 
cific group. In this case spectral analysis directly allows 
to identify a partition of the variables. However these 
properties are no more true when the system is intrin- 
sically hierarchically organized. In fact, the number of 
large eigenvalues can be different from the number of 
groups and the eigenvectors of the correlation matrix as- 
sociated with large eigenvalues have in general all non 
vanishing components, i.e. large eigenvalues cannot be 
associated with specific groups of variables. In the sup- 
plementary material of this paper we describe in detail 
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two simple HNFMs for which the direct eigenvectors' 
analysis fails in identifying the groups and in unveiling 
the hierarchical structure of the system. This result sug- 
gests that it is not possible to associate the largest eigen- 
values neither with specific groups of elements controlled 
by the same factors nor with a common behavior mode 
governing all elements of the system when the nested na- 
ture of groups of elements is significant. To make a spe- 
cific example, consider a financial market. It has been re- 
cently suggested that there is a one to one association be- 
tween the largest eigenvalues of the correlation matrix of 
stock returns with the global market behavior [2ll.l22l.l23T] 
or specific economic sectors [IH HH ■ If financial market 
is hierarchically organized (as proposed below) this as- 
sociation might be less straightforward than originally 
thought (see also Ref. [HI)- In conclusion, basic spectral 
methods, such as principal component analysis, could be 
unable to fully describe the nested nature of hierarchical 
complex systems. For these cases our HNFM guarantees 
a proper hierarchical description of the elements of the 
investigated complex system. To the best of our knowl- 
edge, HNFM is the first model based on empirical data 
in which both the dependency of variables from factors is 
nested and the factors are independent one of each other. 
This choice allows to consider the hierarchical clustering 
procedure from a perspective which is different from the 
one which is commonly adopted. Hierarchical clustering 
is not a tool which is only used to extract a partition 
of the elements but rather it is a tool that can also be 
used to associate a set of factors directly controlled by 
the genealogy of the element in the considered dendro- 
gram with each element of the system. We believe this 
approach is useful in all the cases where a partition of 
the complex system is not straightforwardly feasible due 
to the fact that the system is clearly characterized by 
nested levels of hierarchies. 

Eq. flU defines a HNFM of TV — 1 factors obtained 
from a dendrogram of N elements. In general the num- 
ber of factors determining the dynamics of the system 
can be significantly smaller than N — 1. Moreover, sev- 
eral studies based on random matrix theory [2l|, [22| have 
shown that a correlation coefficient matrix obtained from 
a finite multivariate time series has associated an un- 
avoidable statistical uncertainty that does not allow to 
discriminate between real and spurious factors. To over- 
come this problem, we propose here a method devised 
to select the HNFM characterized by the largest number 
of factors (although in any case less than N) compati- 
ble with a predefined threshold of statistical reliability 
of retained factors. Our method exploits the technique 
of non parametric bootstrap (27j which is widely used in 
phylogenetic analysis. 

The method is illustrated below after we briefly sketch 
the procedure used to associate a bootstrap value with 
each internal node of a dendrogram. Consider a system 
of N time series of length T and suppose to collect data 
in a matrix X with N columns and T rows. A bootstrap 
data matrix X* is formed by randomly sampling T rows 



from the original data matrix X allowing multiple sam- 
pling of the same row. For each replica X* , the associated 
correlation matrix C* is evaluated and a dendrogram is 
constructed by hierarchical clustering. Some large num- 
ber (typically 1000) of independent bootstrap replicas 
is generated and for each internal node of the original 
data dendrogram we compute the fraction of bootstrap 
replicas (commonly referred to as bootstrap value) pre- 
serving the internal node in the dendrogram. Given an 
internal node of the original dendrogram we say that 
a bootstrap replica is preserving that node if and only if 
a node a* h in the replica dendrogram exists and identifies 
a branch characterized by the same leaves identified by 
afe in the original dendrogram. For instance, we say that 
the node «3 of the dendrogram in Fig. [5] is preserved 
in some replica dendrogram D* if and only if a node of 
D* exists such that it belongs to the genealogy of all 
and only the leaves 5, 6, 7, 8, 9 and 10. The bootstrap 
technique allows to associate a bootstrap value with each 
internal node of a dendrogram. Because of the one to one 
relation between nodes in the dendrogram and factors in 
the HNFM, the bootstrap value associated with a cer- 
tain node of the dendrogram is associated also with the 
corresponding factor in the HNFM. 

Since the bootstrap value is a measure of the node's 
(factor's) reliability, we propose to remove those nodes 
(factors) with bootstrap value smaller than a given 
threshold b. This is done by merging each node with 
a bootstrap value smaller than b with its first ancestor 
node in the path to the root having a bootstrap value 
greater than b and then by constructing the HNFM as- 
sociated with this reduced dendrogram. The question is 
how to select a suitable threshold b. The bootstrap value 
of a certain node (factor) cannot be straightforwardly in- 
tended as the probability that the node (factor) belongs 
to the true and unknown hierarchy (model) of the sys- 
tem. For example, in phylogenetic analysis it has been 
shown [28[ that a bootstrap value of more than 70% cor- 
responds to a probability of more than 95% that the true 
phylogeny has been found. By adapting the technique of 
Hillis and Bull p8| , we do not choose a priori the value 
of b but we infer a suitable value of the threshold from 
the data in a self consistent way. Specifically, we choose 
a certain number of bootstrap value thresholds bi, e.g. 
6,; = (i X 10)%, i e {0, 1, 10}. For each value of i, we 
remove internal nodes from the dendrogram according 
to bi obtaining a reduced dendrogram Di and a corre- 
sponding HNFM labeled HNFM 4 . For each value of i, 
we perform n simulations of data according to HNFM, 
and we label X^ with k S {1, ...,n} the data matrix of 
each simulation [29|. To each X^- we apply the cluster- 
ing algorithm and the bootstrap node removal with the 
same threshold bi obtaining a reduced dendrogam . In 
order to compare the reduced dendrogram Di of the orig- 
inal data with the reduced dendrogram Dik of the data 
simulation we measure the sensitivity Sn and specificity 
Sp (see, for instance, [3(|). In our case, the sensitivity 
Sriik is the number of nodes in Di that are preserved in 
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FIG. 2: 1Z — (Sn + Sp)/2 as a function of the bootstrap 
value threshold. The error bar is one standard deviation. 
The dashed line indicates the chosen threshold of statistical 
reliability. 



the reduced dendrogram Dik divided by the total number 
of nodes in the reduced dendrogram Di . The specificity 
Spik is the number of nodes in Di that are preserved in 
the reduced dendrogram Dik divided by the total number 
of nodes in the reduced dendrogram D^ . By averaging 
Sriik and Spik over the n different simulations we ob- 
tain the sensitivity Srii and specificity Spi of the node 
reduction associated with each bootstrap value threshold 
hi. Finally, we obtain a measure of reliability of the den- 
drogram Di and of the corresponding HNFM^ obtained 
for each bootstrap value threshold bi, by averaging speci- 
ficity and sensitivity IZi = (Srii + Spi)/2 |30(. Note that 
we have defined sensitivity and specificity in terms of 
the nodes of the dendrogram Di which are preserved in 
Dik- In an equivalent way Sna- and Spik can be defined 
in terms of the preserved factors in the corresponding 
models, HNFM l and HNFM lfc , i.e. the factors which de- 
termine the dynamics of exactly the same variables in 
both models. IZi can be interpreted as the probability 
averaged over all factors of the HNFM l that a HNFM ifc 
contains a factor which is also present in the HNFM^. 
Removing factors from the HNFM reduces the quantity 
of the empirical variance explained by the model. There- 
fore a satisfying bootstrap value threshold corresponds to 
the minimal value of bi such that IZi is larger than some 
standard threshold of reliability, e.g. 95% or 99%. In the 
example shown in Fig. [5] (discussed below) IZi > 95% for 
bi > 80%. Finally, it should be noted that no assumption 
about the data distribution is needed to implement the 
method. 

We have concluded above that the matrix C < obtained 
by applying some hierarchical clustering technique to a 
correlation matrix is positive definite, provided that its 
elements are non negative numbers. Of course the same 
holds true for the matrix of the HNMF reduced according 
to the described bootstrap technique. 

As an application of the described technique to real 
data we examine a system monitored by recording the 
set of daily equity returns of N = 100 highly capitalized 
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FIG. 3: Dendrogram of the set of daily equity returns of 
100 highly capitalized stocks traded at the NYSE during 
the period 1995-1998 obtained by applying the ALCA to 
the correlation matrix. Colors are chosen according to the 
stock economic sector. Specifically these sectors are Basic 
Materials (violet), Consumer Cyclical (tan), Consumer Non 
Cyclical (yellow), Energy (blue), Services (cyan), Financial 
(green), Healthcare (gray), Technology (red), Utilities (ma- 
genta), Transportation (brown), Conglomerates (orange) and 
Capital Goods (light green). 



stocks traded at the New York Stock Exchange (NYSE) 
during the period 1995-1998 (T = 1011). We apply the 
ALCA to the correlation matrix of the system and we ob- 
tain the dendrogram shown in Fig. [3J The dendrogram 
has N — 1 = 99 nodes. The statistical reliability of these 
nodes is different from node to node due to metric and 
topological characteristics. The metric properties depend 
on the correlation coefficient values whereas the topolog- 
ical characteristics are depending on the ranking of these 
values and therefore on the complexity and number of 
hierarchies of the system. 

We use the bootstrap technique described above, in or- 
der to evaluate the statistical reliability of each node and 
to simplify the description in terms of a HNFM. In par- 
ticular, we select the minimal bootstrap value threshold 
that guarantees a value of IZi > 95%. We accordingly re- 
duce the number of factors of the corresponding HNFM. 
In our investigation, the number of bootstrap replicas is 
1000 and the number of simulations performed for each 
bootstrap value threshold is n = 20. Simulated time se- 
ries have been constructed by using original data. In Fig. 
[5] we plot IZi as a function of the bootstrap value thresh- 
old. A direct inspection shows that the bootstrap value 
threshold bi = 80% guarantees that TZi > 95%. The cor- 
responding reduced dendrogram has 23 nodes and it is 
reported in Fig. 3J 

Let us first comment the properties of the reduced 
HNFM. In the figure we observe several clusters and 
sub-clusters. As already noticed in previous studies 
0, [ill, [H, [24[ , the detected clusters and sub-clusters 
are overlapping in part with economic classification such 
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as the one provided by the Forbes magazine. This can 
be seen in Fig. [3J and Q] where we use this classification 
to characterize with a specific color each stock. For ex- 
ample, financial firms are represented in Fig. [3J and 0] 
as green lines in the hierarchical tree. One prominent 
example is the group of financial stocks. For illustrative 
purposes, let us consider the equations of the financial 
elements of the reduced HNFM. The first three stocks 
from left to right of the group labeled as F in Fig. [4] 
are described by the equation xf (t) = J ai9 f^ ai9 '(t) + 
7a T / (Q7) (*) + £Li W<°*>(*) + VFei(t). The factor 
f {ai) (t) is common to all stocks and f {a2) (t) is common 
to all stocks except one, with tick symbol HM, which 
is a gold company. The factor f^ ai9 '(t) is specific to 
these financial stocks (their tick symbols are BAC, JPM 
and MER). The other six financial stocks also belong- 
ing to the same group (indicated by the tick symbols 
AGC, AIG, AXP, ONE, WFC and USB) are described by 
the equations xf (t) = la7 f (a7 \t) + ^Li laj {ah) (t) + 
ijFCi(t). In this last case only the f^ ai \t) factor is present 
in addition to the f^ ai \t) and f( a2 \t) factors common 
to all financial stocks. Since the factor f^ a7 \t) is de- 
termining the dynamics of only financial stocks (9 out 
of 10 in the investigated sample), it is natural to con- 
sider / (Q7) (0 

as a factor characterizing financial stocks 
whereas f^ ai3 \t) is an additional factor further charac- 
terizing only the three stocks BAC, JPM and MER. A 
similar organization in nested clusters is observed in all 
the groups detected by the reduced HNFM. The num- 
ber of factors characterizing the various stocks is ranging 
from one to five. It is worth noting that each group of 
stocks, which are sharing at least 3 factors, is homoge- 
neous with respect to the economic sector. 

It is also worth to compare Fig. [3J and [U The com- 
parison shows that the self-consistent reduction of the 
number of factors allows a robust statistical validation 
of the groups that are detected from the data analysis. 
Only the information which is statistically robust at the 
95% level is retained in the reduced HNFM. For exam- 
ple, the energy cluster observed in Fig. [3J (blue lines in 
the figure) is not robust at the selected confidence level, 
whereas the two clusters indicated as El and E2 in Fig. 
IU corresponding to the sub-sectors Oil well services and 
equipment and Oil and gas integrated, are robust. In Fig. 
2] all the detected clusters of more than 2 elements and 
consistent with the Forbes classification are indicated by 
rectangles at the bottom of the figure. The economic 
characterization of clusters is discussed in the figure cap- 
tion. 

In summary, we have introduced a method for associat- 
ing a hierarchical factor model with a multivariate data 
set. The factor model is retaining all the information 
about hierarchies extracted from data by a hierarchical 
clustering procedure. We have also provided a bootstrap 
based procedure to obtain the HNFM with the largest 
number of factors compatible with a predefined thresh- 
old of their statistical reliability. This procedure selects 
in a self-consistent way the optimal bootstrap threshold 
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FIG. 4: Dendrogram with 23 internal nodes obtained by node 
reduction of the ALCA dendrogram (shown in Fig. [3j of 100 
stock daily returns traded at the NYSE during the period 
1995-1998. Rectangles at the bottom are indicating 9 clus- 
ters and symbols label the classification of stocks in terms 
of economic sectors or sub-sectors according to the classifica- 
tion of Forbes' magazine. Specifically, El is the sub-sector of 
Oil well services and equipment and E2 is the sub-sector of 
Oil and gas integrated. Both El and E2 belong to the eco- 
nomic sector of Energy; T and F are indicating the economic 
sectors of Technology and Financial respectively; H indicates 
the sub-sector Major drugs of the economic sector Healthcare; 
BM indicates a cluster of stocks within the Basic Material 
economic sector. SI and S2 indicate the two sub-sectors of 
Communication services and Retail of the sector of Services 
respectively. Finally, U is representing the sub-sector Electric 
utilities of the sector Utilities. Colors are chosen according to 
the stock economic sector as described in the caption of Fig. 
and the ordering of the stocks is the same as in Fig. [3] The 
labeled internal nodes are discussed in the text. In the figure 
we do not comment on clusters composed by only two leaves. 



for the considered set of data. We have also shown that 
the similarity matrix C < , which is the output of hier- 
archical clustering procedures, is the proper correlation 
matrix of our model and therefore it is positive definite. 
Finally, we have used the HNFM to model a financial 
system of 100 highly capitalized stocks traded at NYSE. 
This empirical analysis has shown the ability of HNFM 
in the modeling of a complex system characterized by 
nested levels of hierarchies inferred from data. 
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I. APPENDIX 

In the present supplementary material we introduce 
two simple time series models and we compare how 
straightforward spectral methods and hierarchical meth- 
ods are able to unveil the hierarchical properties of the 
models. The two models are hierarchically nested factor 
models. 

As a first example, we consider a model (already intro- 
duced in [3l| ) in which the N variables follow a common 
factor fo(t) and two other factors fi(t) and / 2 (0 which 
are affecting two distinct groups of n\ and n 2 = N — n\ 
elements respectively. The equations of the model are 

Xi(t) = 7o/o(0 +71/1(0 +??iei(i),Vi < nx 

Xi(t) = 70/0 (0 +72/2(i) + V* ■■ n x < i < N, (3) 

where 70, 71, 72 and rji (i — 1,2) are parameters. In 
this equation the factors fiit) and the terms ti{t) are 
independent noise terms with zero mean and unit vari- 
ance. We consider again variables Xi with zero mean and 
unit variance without loss of generality. This choice fixes 
the value of r\i. We set p ai = 70, p a2 — 7o + 72 an d 
Pa 3 = 7o + 7i- The eigenvalue spectrum of the correla- 
tion coefficient matrix of this model has two large eigen- 
values given by A ± = [2 + q + ± (q 2 _ + 4n 1 n 2 p 2 ai ) 1/2 }/2, 
where q± = [n\ — l)p a3 ± ("2 — l)Pa 2 i n l ~ 1 eigen- 
values equal to 1 — p a3 and n 2 — 1 eigenvalues equal to 
1 — pa 2 . Thus, despite the fact that the original fac- 
tor model of Eq. © has three uncorrelated factors fi (t) , 
(i = 0, 1, 2), the spectrum has only two large eigenvalues. 
One could be tempted to interpret these large eigenvalues 
and the corresponding eigenvectors as describing the col- 
lective dynamics or the dynamics of the two groups. By 
analyzing the eigenvectors, it can be seen that this is not 
the case. The eigenvectors of the two largest eigenvalues 
have infra-group degenerate components and neither the 
first nor the second eigenvector is in general proportional 
to the vector {1,1,...,1} representing the common be- 
havior driven by the factor fo(t). Similarly, when one at- 
tempts to associate the first two eigenvectors with the two 
groups, one is faced with the fact that the first two eigen- 
vectors have all non vanishing components. Our model 
indicates that the association between eigenvectors and 
factors is correct only in the limit when the system can 
be divided in groups of variables and each group is driven 
only by one factor. The generalization of the model to the 
case of heterogeneous 7 parameters and/or the finiteness 
of empirical time series makes even more involved the 
task of associating factors with eigenvectors when the 
correlation matrix of the model has hierarchical features. 
On the other hand, by applying a hierarchical cluster- 
ing procedure, e.g. single linkage, average linkage and 
complete linkage, to the correlation matrix of the model 
of Eq. ([3]) one obtains the hierarchical tree of Fig. [5JA 
The corresponding HNFM coincides with the model of 
Eq. ([3]). We have verified that by applying the bootstrap 
method we have introduced in our paper, we obtain back 



the HNFM of Eq. ([3]) also when we take into account the 
role of a finite number of records of multivariate time se- 
ries (in our simulations we set T = 1011 and N = 100). 
Moreover, simulations have been performed by assum- 
ing the variables Xi{t) being either Gaussian distributed 
or Student-t distributed with 4 degrees of freedom. In 
both cases the recovered HNFM is the same and coin- 
cides with the model of Eq. ([3]) . The threshold of relia- 
bility used to reduce the number of factors in the HNFM 
is 7Z = 95% and the hierarchical clustering algorithm 
used is the Average Linkage Cluster Analysis (ALCA). 
This result shows that our method based on a hierarchi- 
cal clustering procedure is able to recover the structure 
of the HNFM whereas basic spectral methods such as, for 
example, the principal component analysis are unable to 
uncover it. More specialized spectral methods, such as 
the varimax and the promax (or oblique rotation) meth- 
ods [32j |. are in most cases also unable to transform the 
eigenvectors associated with the two large eigenvalues of 
the model of Eq. §5§ in such a way that each eigenvec- 
tor has non-vanishing components only for the variables 
belonging to one of the two groups. 

The second example we wish to consider is again a 
3-factor model but with a completely nested structure. 
The equations of the model are: 

Xi(t) = 7o/o(0 +7i/i(*)+72/ 2 (0 +rnei(jt),Vi < n, 
Xi(t) = 7o/o(0 +72/2(0 +V2ti(t),Vi :n<i<2n 1 
Xi{t) = 7o/o(0 + %£i(0.V« : 2n < i <3n = N (2) 

and, as in the previous case, we consider random vari- 
ables with zero mean and unit variance. The dendro- 
gram associated with this model is shown in Fig. IB. The 
eigenvalue spectrum of the correlation matrix has 3 large 
eigenvalues and 3 small eigenvalues each one with degen- 
eracy n—l. The most general case is analytically solvable 
but the eigenvalues and eigenvectors cannot be expressed 
in a compact way. Thus here we set 70 = 7i = \[f> and 
72 = \f2~p. With these simplifying parameters, the model 
of Eq. ([2]) is depending only on the parameters n and p. 
The space described by the eigenvectors of the 3 largest 
eigenvalues is the space of vectors z = {zi = u, z n = u, 
z n +i = v,...,z 2n = v,z 2n+ i = w,...,z N = w}, i.e. 
the space of vectors with infra-group degenerate com- 
ponents. When np ^S> 1, the first 3 eigenvalues are 
Ai = (3 + y/7)np, A2 = np and A 3 = (3 — \fl)np. Since 
the components of the corresponding eigenvectors are de- 
fined only in terms of u, v and w we represent eigen- 
vectors as characterized by 3 parameters by using the 
formalism s — {u,v,w}. It results that the non normal- 
ized eigenvectors are Si = {8 + 3\/7, 5 + 2^7, 3 + VI}, 
s 2 = {-1, 1, 1} and s 3 = {3^7 - 8, 2^7 - 5, v7 - 3}. 
This result implies that also in this case the first 3 eigen- 
values are associated with eigenvectors with degener- 
ate non vanishing infra-group components. Moreover, 
none of these eigenvectors is proportional to the vector 
{1, 1, 1} representing the common behavior driven by 
the factor /o(0- On the other hand, by applying the 
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FIG. 5: A) Dendrogram associated with the model of Eq. 
(1). B) Dendrogram associated with the model of Eq. (2) 



ALCA to the correlation matrix of the model of Eq. $Z$i 
one obtains the dendrogram of Fig IB. The HNFM cor- 



responding to this dendrogram coincides with the model 
of Eq. ©. 

In summary this two examples of HNFM show that it 
is not always possible to associate the largest eigenvalues 
of the correlation matrix neither with specific groups of 
elements nor with all elements. It is also to notice that in 
the first example we have found 2 large eigenvalues in a 
system driven by 3 factors whereas in the second case we 
have observed 3 large eigenvalues for a model with 3 fac- 
tors. This means that there is no direct relation between 
the number of factors in the HNFM and the number of 
large eigenvalues of the corresponding matrix C < . These 
results indicate that standard spectral methods are not 
always suitable for the analysis of systems in which hier- 
archies are present. 
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